{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3B. Data Modeling: Seasonality\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've explored what the seasonality dataset looks like, the goal is to be able to make modifications to our average prediction from the initial regression to account for seasonality of pricing. This basically means that we will make +/- modifications to our average prediction based on the the day that we are projecting the data for. We can also do similar things for months of the year as well as holidays. This will hopefully reduce the residuals because the seasonality of the pricing data would cause some correlation among the residuals (based on time of the year) violating a lot of the OLS assumptions. We use averages as a way to explore seasonality. More advanced seasonality measurements could be used if we had more data over several years (where we could build an ARIMA or SARIMA model)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib\n",
    "import seaborn as sb\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.cm as cmx\n",
    "import matplotlib.colors as colors\n",
    "from sklearn import linear_model\n",
    "import sklearn.metrics as metrics\n",
    "from sklearn.grid_search import GridSearchCV\n",
    "from sklearn import preprocessing\n",
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "from sklearn.linear_model import LinearRegression as Lin_Reg\n",
    "from sklearn.linear_model import RidgeCV\n",
    "from sklearn.linear_model import LassoCV\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.ensemble import GradientBoostingRegressor\n",
    "import pylab\n",
    "import scipy.stats as stats\n",
    "%matplotlib inline\n",
    "import datetime as dt\n",
    "from datetime import datetime"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Process Overview:\n",
    "The general idea behind this analysis is as follows: we aggregate prices by weekday for each listing. Then, we normalize each listing's price by the monday price to find an average multiplier for each listing for each day. Then, for each day we average across all listings to get a final average multiplier for each day. Lastly, we compare these predictions to a subset of the listings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Mon</th>\n",
       "      <th>Tue</th>\n",
       "      <th>Wed</th>\n",
       "      <th>Thu</th>\n",
       "      <th>Fri</th>\n",
       "      <th>Sat</th>\n",
       "      <th>Sun</th>\n",
       "      <th>listing_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>3604481.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>2949128.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0.991826</td>\n",
       "      <td>0.991826</td>\n",
       "      <td>0.999846</td>\n",
       "      <td>1.138965</td>\n",
       "      <td>1.138965</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>4325397.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>4325398.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0.991494</td>\n",
       "      <td>0.994952</td>\n",
       "      <td>1.004395</td>\n",
       "      <td>1.015027</td>\n",
       "      <td>1.011263</td>\n",
       "      <td>0.99895</td>\n",
       "      <td>3426149.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Mon       Tue       Wed       Thu       Fri       Sat      Sun  listing_id\n",
       "0    1  1.000000  1.000000  1.000000  1.000000  1.000000  1.00000   3604481.0\n",
       "1    1  1.000000  1.000000  1.000000  1.000000  1.000000  1.00000   2949128.0\n",
       "2    1  0.991826  0.991826  0.999846  1.138965  1.138965  1.00000   4325397.0\n",
       "3    1  1.000000  1.000000  1.000000  1.000000  1.000000  1.00000   4325398.0\n",
       "4    1  0.991494  0.994952  1.004395  1.015027  1.011263  0.99895   3426149.0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Importing Datafile\n",
    "results_nona = pd.read_csv('../datasets/seasonality_tomodel.csv')\n",
    "results_multiplier = pd.read_csv('../datasets/seasonality_tomodel.csv')\n",
    "b=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']\n",
    "for i in b[1:7]:\n",
    "    results_multiplier[i] = results_multiplier[i]/results_multiplier['Mon']\n",
    "results_multiplier['Mon']= 1\n",
    "b=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']\n",
    "for i in b[1:7]:\n",
    "    results_multiplier[i] = results_multiplier[i]/results_multiplier['Mon']\n",
    "results_multiplier['Mon']= 1\n",
    "results_multiplier.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We see that the dataframe now contains a multiplier for each day of the week for each listing. Now we take an average for each day(averaging across all listings) to see an average multiplier value for each day"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Fri': 1.0306309142782333,\n",
       " 'Mon': 1.0,\n",
       " 'Sat': 1.0306744793102525,\n",
       " 'Sun': 1.0009269127770359,\n",
       " 'Thu': 1.0003671343632679,\n",
       " 'Tue': 0.99980071473169119,\n",
       " 'Wed': 0.9998361380714712}"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "multiplier = dict.fromkeys(b)\n",
    "for index,i in enumerate(multiplier):\n",
    "    multiplier[i]=results_multiplier.mean()[i]\n",
    "multiplier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are very much in line with what we saw earlier in our seasonality-exploration file. Monday and Tuesday see a slight dip in their prices(99.9%) while Friday and Saturday see a sizable increase in prices (103%). These are thus the numbers we will be using to apply seasonality to the averages from our previous models. Here, we apply it to RidgeCV the best one that we found."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predicting Prices Using Our Seasonality Averages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, it is important to test the performance of the averages we arrived at. Here we seek to utilize the RidgeCV regression-- one of the best ones from the models we ran-- and apply seasonality training to it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#We import the data and rerun the RidgeCV Regression\n",
    "data = pd.read_csv('../datasets/listings_clean.csv')\n",
    "data.head()\n",
    "# split into x and y (note that we do not include id and host_id as predictors)\n",
    "x = data.iloc[:, 2:-2]\n",
    "y = data.iloc[:, -2]\n",
    "y_log = data.iloc[:, -1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "reg_params = 10.**np.linspace(-10, 5, 10)\n",
    "RidgeCV_model = RidgeCV(alphas=reg_params, fit_intercept=True, cv=5)\n",
    "RidgeCV_model.fit(x,y_log)\n",
    "sample = results_nona.sample(frac=0.4,axis=0)\n",
    "# some of the id's in the sample can't be found. So at the end we readjust the sample dataframe too so they have the same entries\n",
    "sample_variables=data.loc[data['id'].isin(sample['listing_id'])]\n",
    "sample_variables.head(5)\n",
    "sample_variables.shape\n",
    "sample = sample.loc[sample['listing_id'].isin(sample_variables['id'])]\n",
    "X_sample = sample_variables.iloc[:, 2:-2]\n",
    "new_predictions = sample.copy()\n",
    "new_predictions.loc[:,0:7]=0\n",
    "new_predictions['Mon']=np.exp(RidgeCV_model.predict(X_sample))\n",
    "for i in b[1:]:\n",
    "    new_predictions[i]=np.exp(RidgeCV_model.predict(X_sample))*multiplier[i]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 177,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Mon</th>\n",
       "      <th>Tue</th>\n",
       "      <th>Wed</th>\n",
       "      <th>Thu</th>\n",
       "      <th>Fri</th>\n",
       "      <th>Sat</th>\n",
       "      <th>Sun</th>\n",
       "      <th>listing_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>875</th>\n",
       "      <td>38.752293</td>\n",
       "      <td>38.744570</td>\n",
       "      <td>38.745943</td>\n",
       "      <td>38.766520</td>\n",
       "      <td>39.939311</td>\n",
       "      <td>39.941000</td>\n",
       "      <td>38.788213</td>\n",
       "      <td>4526313.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>787</th>\n",
       "      <td>117.820157</td>\n",
       "      <td>117.796677</td>\n",
       "      <td>117.800850</td>\n",
       "      <td>117.863412</td>\n",
       "      <td>121.429096</td>\n",
       "      <td>121.434229</td>\n",
       "      <td>117.929366</td>\n",
       "      <td>2097795.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2093</th>\n",
       "      <td>93.586445</td>\n",
       "      <td>93.567795</td>\n",
       "      <td>93.571110</td>\n",
       "      <td>93.620804</td>\n",
       "      <td>96.453083</td>\n",
       "      <td>96.457161</td>\n",
       "      <td>93.673192</td>\n",
       "      <td>3409586.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2721</th>\n",
       "      <td>95.829400</td>\n",
       "      <td>95.810303</td>\n",
       "      <td>95.813698</td>\n",
       "      <td>95.864583</td>\n",
       "      <td>98.764742</td>\n",
       "      <td>98.768917</td>\n",
       "      <td>95.918226</td>\n",
       "      <td>3027838.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007</th>\n",
       "      <td>246.199927</td>\n",
       "      <td>246.150863</td>\n",
       "      <td>246.159585</td>\n",
       "      <td>246.290316</td>\n",
       "      <td>253.741256</td>\n",
       "      <td>253.751982</td>\n",
       "      <td>246.428133</td>\n",
       "      <td>2893512.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Mon         Tue         Wed         Thu         Fri         Sat  \\\n",
       "875    38.752293   38.744570   38.745943   38.766520   39.939311   39.941000   \n",
       "787   117.820157  117.796677  117.800850  117.863412  121.429096  121.434229   \n",
       "2093   93.586445   93.567795   93.571110   93.620804   96.453083   96.457161   \n",
       "2721   95.829400   95.810303   95.813698   95.864583   98.764742   98.768917   \n",
       "2007  246.199927  246.150863  246.159585  246.290316  253.741256  253.751982   \n",
       "\n",
       "             Sun  listing_id  \n",
       "875    38.788213   4526313.0  \n",
       "787   117.929366   2097795.0  \n",
       "2093   93.673192   3409586.0  \n",
       "2721   95.918226   3027838.0  \n",
       "2007  246.428133   2893512.0  "
      ]
     },
     "execution_count": 177,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_predictions.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Mon</th>\n",
       "      <th>Tue</th>\n",
       "      <th>Wed</th>\n",
       "      <th>Thu</th>\n",
       "      <th>Fri</th>\n",
       "      <th>Sat</th>\n",
       "      <th>Sun</th>\n",
       "      <th>listing_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>875</th>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>4526313.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>787</th>\n",
       "      <td>210.0</td>\n",
       "      <td>210.0</td>\n",
       "      <td>210.0</td>\n",
       "      <td>210.0</td>\n",
       "      <td>210.0</td>\n",
       "      <td>210.0</td>\n",
       "      <td>210.0</td>\n",
       "      <td>2097795.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2093</th>\n",
       "      <td>75.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>95.0</td>\n",
       "      <td>95.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>3409586.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2721</th>\n",
       "      <td>130.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>3027838.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007</th>\n",
       "      <td>80.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>2893512.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Mon    Tue    Wed    Thu    Fri    Sat    Sun  listing_id\n",
       "875   100.0  100.0  100.0  100.0  100.0  100.0  100.0   4526313.0\n",
       "787   210.0  210.0  210.0  210.0  210.0  210.0  210.0   2097795.0\n",
       "2093   75.0   75.0   75.0   75.0   95.0   95.0   75.0   3409586.0\n",
       "2721  130.0  130.0  130.0  130.0  130.0  130.0  130.0   3027838.0\n",
       "2007   80.0   80.0   80.0   80.0   80.0   80.0   80.0   2893512.0"
      ]
     },
     "execution_count": 178,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see already from the head that the output of our seasonality data may not yield the best results. The top data frame is our projections versus the lower which is the actual prices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "72.1070818480257"
      ]
     },
     "execution_count": 179,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metrics.median_absolute_error(sample.iloc[:,:-1].values.flatten(), new_predictions.iloc[:,:-1].values.flatten())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overall, our results did suffer from greater error than our original RidgeCV regression, as we experience a median absolute error of $\\$72.10$. We came to our predictions by using our RidgeCV to predict an average price for our sample of the total listings( we used $40\\%$ of the total dataset to help it run faster). Then, we multiplied that average by the appropriate day multiplier to predict the price by day of the week. Lastly, we looked at the errors this generated versus the actual prices by day of the week that we know. This is for three reasons-- many of the listings still do not incorporate day-variations in their prices, large rather than small price difference between days of the week, and inaccuracies within the original RidgeCV predictions.\n",
    "To address the first inaccuracy, we see that most of the listings don't vary prices over time. This can be seen by the extremely clean averages that are the same each day for every listing. This is strongly suggestive that the lister just uses one set price no matter what time of the year it is. If there were some variation, then the averages would be a lot more messy. This directly leads to the second point. Because so many people do not use dynamic, day-based pricing, the overall effect we see and apply is very minimal (with a max of $103\\%$ on Friday and Saturday as we said earlier). However, when we look at the strategies of people who do employ dynamic pricing, the price changes are never $3\\%$. For example, we see ID 3409586 increases its average listing price from \\$75 to \\$95 on Fridays and Saturdays. Thus, even for listings that do use dynamic pricing, the multipliers we applied are less than the shifts in prices actually used. Lastly, a lot of the error here is in predicting the average price from our RidgeCV regression. In the future, by better predicting the average price, it will also dramatically increase the accuracy of our seasonality analysis. However, all in all this was a great step forward."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using ARIMA time series models for future forecasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "As inspired by [this Duke webpage about using ARIMA models for time series forecasting](https://people.duke.edu/~rnau/seasarim.htm), we feel like this could be a model with exploring with our AirBnB data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take the cross section of a price on each day of the year and then map that time series."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "<img src=\"../img/average price time series.png\" width=\"400\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that this time series is not stationary, thus we take a first order difference and see that the time series is a lot more stationary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../img/1 difference.png\" width=\"400\">\n",
    "\n",
    "This looks a lot better so then we can look at the ACF and PACF graphs to estimate the parameters of the ARIMA model. The ACF supports our analysis from the averages in that there is high autocorrelation in the multiples of lag=7.\n",
    "\n",
    "<img src=\"../img/acf.png\" width=\"400\">\n",
    "<img src=\"../img/PACF.png\" width=\"400\">\n",
    "\n",
    "Lastly, we use the forecast package in R to have it choose the best model based on lowest AIC. We ultimately come up with a model of ARIMA(4,0,3) as highlighted by this output from R\n",
    "<img src=\"../img/arima model.png\" width=\"400\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This was just a beginning exploration, however with more comprehensive data, this type of modeling could prove to be very powerful for understanding the seasonal nature of the pricing data. With only one year at our disposal though, it is hard to project forward prices and understand annual pricing trends. However, this is very promising!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Seasonality is perhaps the most promising area of the entire project because it shows that many Airbnb listers are not taking advantage of dynamic pricing by the day of the week, something that is important to establish optimal pricing. Already, there are some promising results-- people should price Friday and Saturday the highest and Tuesday and Wednesday the lowest. With further refinements to our model, such as looking at various seasonal time series models, hopefully people can look at the trends and be able to price their AirBnB listings more appropriately with more concrete percentage change suggestions.\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
